Machine learning (ML) has penetrated all the fields that one can name. However, Supervised learning has a higher success rate than other techniques. The AI Trinity (availability of massive data, better algorithms, and powerful computational infrastructure) are the major driving factors in the success of supervised learning.
Supervised learning algorithms can work with labeled data only. Though, Annotating the data is expensive, because it may require: i) excessive time and manual effort ii) expensive sensors. Let us understand how expensive labeling is with a few examples.
Let us say we want to convert a speech or audio into text for an application like subtitle generation (speech-to-text task). The annotator has to annotate each segment of the time window in the audio. The example audio of 6 seconds here may require around a minute to annotate manually. Thus, annotating millions of hours of publicly available audio data is an almost impossible task.
Similarly, researchers have made efforts to detect COVID-19 from the sound of the human cough [AI4COVID-19], where again, getting the correct labels is difficult.
Let us take another example.
Let us say we want to classify different human activities into different categories (shown in the figure). We need expensive sensors to monitor the alignment or motion of the human body for such tasks. In the end, we need to map the sensor data with various activities with substantial manual effort.
We know that increasing the training (labeled) data increases model performance. Though, all the samples do not have an equal role in improving the model. Let us understand this with a few examples.
We use an artificial two-class data set generated from a bivariate normal distribution. We train a Support Vector Classifier (SVC) model on a sample of this dataset (5 data points) and visualize the decision boundary.
Support vector points help the SVC model to distinguish between various classes. Some instances in the above diagram are misclassified because we have used very few training points. Consider some candidate points A, B, C, and D from unlabeled data points. Point B or D is closer to the confusion area than Point A, or C. Thus, points B and D are more critical in improving the model if added to the train points.
Let us consider the MNIST dataset (a well-known public dataset with labeled images of digits $0$ to $9$) for the classification task. We train the logistic regression model on a few random samples of the MNIST dataset. Let us see what our model learns with a stratified set of $50$ data points ($5$ samples for each class). We show the normalized confusion matrix over the test set having $10000$ samples.
We can see from the confusion matrix that few digits have more confusion than others. For example, the digit '1' is having almost no confusion; digit '9' is confused with '7' and '4'. This is because some digits are difficult to distinguish from the model's perspective. Thus, we may need more training examples for such digits to learn them correctly. Now, we will see a regression-based example.
We will consider a sine curve data with added noise. We take a few samples (8 samples) as train points, a few as candidate points, and rest as the test points. Candidate points are the potential train points.
We will fit a GPR (Gaussian Process Regressor) model to our dataset with the Matern kernel. As the model output, we get predictive mean along with predictive variance. The predictive variance shows the confidence of the model about its predictions.
We can observe that uncertainty (predictive variance) is higher at distant points from the train points. Let us consider a set {A, B, C, D}, and check if they are equally useful for the model.
One can claim that adding points {A, D} to the train set is better than adding points {B, C} from RMSE, and predictive variance. Note that adding points to the train set is the same as labeling unlabeled data and using them for training. One may either have an intelligent way to choose these 'good' points or randomly choose some points and label them. Active Learning techniques can help us determine the 'good' points, which are likely to improve our model. Now, we will discuss Active Learning techniques in detail.
Wikipedia quotes the definition of Active Learning as the following,
The below diagram illustrates the general flow of Active Learning.
As shown in the flow diagram, the model sends a few samples to the oracle (human annotator or data source) from an unlabeled pool or distribution for labeling. The samples are chosen intelligently by some criteria. Thus, Active Learning is also called as optimal experimental design in other words [link].
An ML model can randomly sample data points and send them to an oracle for labeling. Random sampling will also eventually result in capturing the global distribution in the train points. However, Active Learning aims to improve the model by intelligently selecting the data points for labeling. Thus, Random sampling is an appropriate baseline to compare with Active Learning.
We have mainly three different scenarios of Active Learning:
The pool-based sampling scenario is suitable for most of the real-world applications. Considering the space and time constraint, we restrict our article on the pool-based sampling only.
If we already have an unlabeled data set pool, we can query the data points with the following methods:
We will demonstrate each of the above strategies with examples in the subsequent sections.
There are different approaches for the Classification and Regression tasks in Uncertainty sampling. We will go through them one by one with examples here.
The MNIST dataset is a well-known dataset having thousands of different images of digits 0 to 9. We have shown some examples here.
We will now fit the Random Forest Classifier model (an ensemble model consisting of multiple Decision Tree Classifiers) on a few random samples (50 samples) and visualize the predictions. We will explain different ways to perform uncertainty sampling using the predictions.
Above are the model predictions in terms of probability for a few random test samples. We can use different uncertainty strategies as the following.
Least confident: In this method, we choose samples for which the most probable class's probability is minimum. In the above example, sample 1 is least confident about its highest probable class digit '4'. So, we will choose sample 1 among all for labeling using this approach.
Margin sampling: In this method, we choose samples for which the difference between the probability of the most probable class and the second most probable class is minimum. In the above example, sample 1 has the least margin; thus, we will choose sample 1 for labeling using this approach.
Entropy: Entropy can be calculated for N number of classes using the following equation, where $P(x_i)$ is predicted probability for $i^{th}$ class. \begin{equation} H(X) = -\sum\limits_{i=0}^{N}P(x_i)log_2P(x_i) \end{equation} Entropy is likely to be more if the probability is distributed over all classes. Thus, we can say that if entropy is more, the model is more confused among all classes. For the above example, sample 2 has the highest entropy in predictions. So, we can choose the same for labeling.
We will now see the effect of Active Learning with these strategies on test data (contains 10000 samples). We will continue using the Random Forest Classifier model for this problem. We start with 50 samples as the initial train set and add 100 actively chosen samples over 100 iterations.
The above animation shows individual F1-scores and overall F1-scores for all digits after each iteration. We can see that each of the strategies, except random sampling, tends to choose more samples of a digit class having a lower F1-score. Margin sampling performs better than the other strategies in terms of F1-score. Margin sampling and Least confident method easily outperform the random baseline. The entropy method, in this case, is comparable to the random baseline. The Figure below shows a comparison of all strategies.
Thus far, we have seen uncertainty for classification tasks. Now, we will take an example of regression to explain uncertainty sampling.
We consider the sine curve dataset we've used earlier for this task. We fit the Gaussian Process regressor model with Matern kernel on randomly selected 5 data points. Uncertainty measure for the regression tasks is the standard deviation or the predictive variance. In this example, we will take predictive variance as our measure of uncertainty.
As per uncertainty criteria, samples with predictive variance should be queried for a label. Now, we will show a comparison of uncertainty sampling with the random baseline for ten iterations. We are also showing the next sample to query at each iteration.
We are showing a comparison between uncertainty sampling and random sampling in the above animation. One can observe that uncertainty sampling-based samples are more informative to the model and ultimately help reduce model uncertainty (variance) and RMSE compared to random sampling.
Now, we will discuss the Query by Committee method.
Query by Committee approach involves creating a committee of two or more learners or models. Each of the learners can vote for samples in the pool set. Samples for which all committee members disagree the most are considered for querying. For classification tasks, we can take a mode of votes from all learners, and in regression settings, we can take average predictions from all the learners. The central intuition behind QBC is to minimize the version space. Initially, each model has different hypotheses that try to converge as we query more samples.
We can set up a committee for the QBC using the following approaches
We will explain the first approach with the SVC (Support Vector Classifier) model with RBF kernel and Iris dataset. We do not describe other approaches due to space and time constraints.
We initially train the model on six samples and Actively choose 30 samples from the pool set. We will test the model performance at each iteration on the same test set of 30 samples.
Points queried by the committee are the points where learners disagree the most. This can be observed from the above plot. We can see that initially, all models learn different decision boundaries for the same data. Iteratively they converge to a similar hypothesis and thus start learning similar decision boundaries.
We now show the comparison of the overall F1-score between random baseline and our model. QBC, most of the time, outperforms the random baseline.
There are few more Active Learning techniques we are not covering here constrained to space and time. But we describe them in brief here:
With this, we complete the visual tour to Active Learning techniques.